Best Evaluation Set AI Tools & Models - Premium Evaluation Set News

AI News

China Academy of Information and Communications Technology Releases Fangsheng 3.0 Large Model Benchmark Test

CAICT releases the "Fangsheng" AI Evaluation System 3.0, adding tests for basic model attributes, systematically evaluating parameter scale and inference efficiency, and prospectively setting up ten advanced intelligent tests including full-modal understanding and long-term memory, promoting the development of AI evaluation in China.

9.7k 6 hours ago

Shanghai Accelerates the Application of AI Technology in the Medical Equipment Field, Promoting the Development of the High-End Industry Chain

The AI open source ecosystem is undergoing an unprecedented transformation. Ant Group released the second version of its large model open source development panorama and trends at the Bund Conference, which acts as a mirror, clearly reflecting the true state of this rapidly evolving field. The creation of this panorama is not simply a collection of data, but the result of careful selection through a rigorous OpenRank evaluation system. The research team set the threshold at an OpenRank score above 50, evaluating the relative influence of projects by analyzing their collaboration relationships, and ultimately selecting from the vast open source landscape.

8.8k 12-03

Shanghai Accelerates the Application of AI Technology in the Medical Equipment Field, Promoting the Development of the High-End Industry Chain

Major Shake-up in the Open-Source Ecosystem! Ant Releases AI Project Panorama 2.0, 114 Projects Witness the Wave of Technological Transformation

The artificial intelligence open-source ecosystem is undergoing an unprecedented transformation. Ant Group officially released the version 2.0 of the large model open-source development panorama and trends at the Bund Conference, which acts like a mirror, clearly reflecting the true situation of this rapidly evolving field. The creation of this panorama is not simply a collection of data, but the result of careful selection through a rigorous OpenRank evaluation system. The research team set the threshold at OpenRank greater than 50, and by analyzing the collaboration relationships between projects to assess their relative influence, ultimately selecting from the vast open-source world.

11.9k 4 days ago

GPT-5 Rises to the Top of the LMArena Ranking: Setting a New Record in AI Model Evaluation

No description available

10.9k 4 hours ago

GPT-5 Rises to the Top of the LMArena Ranking: Setting a New Record in AI Model Evaluation

AI Products

Patronus GLIDER

A general evaluation model for assessing text, dialogue, and RAG settings.

AI model

6.9k

Models

Doubao-1.5-pro-32k

Bytedance

$0.8

Input tokens/M

Output tokens/M

128

Context Length

Hunyuan-Large-Vision

Tencent

Input tokens/M

Output tokens/M

Context Length

QianfanHuijin-Reason-8B

Baidu

Input tokens/M

Output tokens/M

Context Length

Hunyuan-Functioncall

Tencent

Input tokens/M

Output tokens/M

Context Length

Hunyuan-TurboS-Longtext-128k-20250325

Tencent

$1.5

Input tokens/M

Output tokens/M

128

Context Length

Hunyuan-Standard

Tencent

$0.8

Input tokens/M

Output tokens/M

Context Length

Baichuan-M2-32B

Baichuan

Input tokens/M

Output tokens/M

Context Length

ERNIE X1.1 Preview

Baidu

Input tokens/M

Output tokens/M

Context Length

Qwen_v2.5_0.5b_Instruct

Alibaba

Input tokens/M

Output tokens/M

128

Context Length

Hunyuan-Lite

Tencent

Input tokens/M

Output tokens/M

250

Context Length

Spark Max

Iflytek

Input tokens/M

Output tokens/M

Context Length

ERNIE-4.5-0.3B

Baidu

$0.1

Input tokens/M

$0.4

Output tokens/M

128

Context Length

GLM-4-Plus

Chatglm

$100

Input tokens/M

$100

Output tokens/M

128

Context Length

CogView-3

Chatglm

Input tokens/M

Output tokens/M

Context Length

Yi-34B-200K

01-ai

Input tokens/M

Output tokens/M

200

Context Length

Yi-34B

01-ai

Input tokens/M

Output tokens/M

Context Length

MCP

Devtools Debugger Mcp

The Node.js Debugger MCP server provides complete debugging capabilities based on the Chrome DevTools protocol, including breakpoint setting, stepping execution, variable inspection, and expression evaluation.

typescript

10k

4.0points

The Mcp Company

OpenHands clone project for AI agent evaluation, supporting browser tools, oracle tool sets, and tool retrieval functions

python

2.5points

AWorld

AWorld is a multi - agent system framework aiming to bridge the gap between theoretical MAS capabilities and practical applications, providing a full - set solution from single - agent to multi - agent collaboration/competition. The project supports scenarios such as browser/mobile operations and GAIA benchmark testing, adopts a client - server architecture, integrates a rich toolchain, and includes performance evaluation and training functions.

python

6.6k

2.0points

Empowering the future, your artificial intelligence solution think tank

English 简体中文繁體中文にほんご

FirendLinks:

AI Newsletters AI Tools MCP Servers AI News AIBase LLM Leaderboard AI Ranking

Business Cooperation Site Map

AI News

China Academy of Information and Communications Technology Releases Fangsheng 3.0 Large Model Benchmark Test

Shanghai Accelerates the Application of AI Technology in the Medical Equipment Field, Promoting the Development of the High-End Industry Chain

Major Shake-up in the Open-Source Ecosystem! Ant Releases AI Project Panorama 2.0, 114 Projects Witness the Wave of Technological Transformation

GPT-5 Rises to the Top of the LMArena Ranking: Setting a New Record in AI Model Evaluation

AI Products

Patronus GLIDER

Models

Doubao-1.5-pro-32k

Hunyuan-Large-Vision

QianfanHuijin-Reason-8B

Hunyuan-Functioncall

Hunyuan-TurboS-Longtext-128k-20250325

Hunyuan-Standard

Baichuan-M2-32B

ERNIE X1.1 Preview

Qwen_v2.5_0.5b_Instruct

Hunyuan-Lite

Spark Max

ERNIE-4.5-0.3B

GLM-4-Plus

CogView-3

Yi-34B-200K

Yi-34B

VideoMAE_kinetics_wlasl_100__signer_20ep_coR

VideoMAE_kinetics_wlasl2000_20epoch_signer

VideoMAE_kinetics__wlasl_2000_20epoch

RoBERTA Joke Rater

VideoMAE_base__wlasl_100_20epoch

Llama71b Mentalchat16k

Mt5 Small Finetuned Amazon En Es

VideoMAE_Base_wlasl_100_longtail_200

VideoMAE_Base_WLASL_100_200_epochs_p20_SR_8

Sweagent Qwen Coder 32b 3epochs 32k 5e 5

Correction

Leadscanr JobClassifier Domain

Leadscanr MessageClassifier Type

Ctsinov1

License Plate Detr Dinov3

Simia Tau SFT Qwen3 8B

Simia Tau SFT Qwen2.5 7B

Simia Officebench SFT Qwen2.5 7B

ScamGuard

Cwe Parent Vulnerability Classification Roberta Base

MCP

Devtools Debugger Mcp

The Mcp Company

AWorld